2. Logistic Regression

arrow_back Back to Experiments

2. Logistic Regression

Aim

    To understand and implement the Logistic Regression algorithm for binary classification problems, and to analyze how it models the probability of outcomes using the logit function and decision threshold to classify data into distinct categories..

Understand the Logistic Regression Algorithm Before You Begin

Overview: Logistic Regression is an extension of the linear regression model used to predict categorical outcomes, especially binary classes(such as Yes/No or 0/1). Unlike linear regression, it models the probability of an event using a logit function, making it suitable for classification tasks.

In logistic regression, a decision threshold (commonly 0.5) determines the final class label. Choosing this threshold carefully is important, as it affects precision and recall, which measure prediction quality.

Key Assumptions:
1. The dependent variable must be categorical.
2. The independent variables should not exhibit multicollinearity (strong correlation among predictors).

Further Understanding: Logistic Regression

Algorithm

  1. Load the Dataset: Load the diabetes dataset.
  2. Binarize the Target: Convert the continuous target variable into a binary variable based on the median.
  3. Split the Dataset: Split the dataset into training and testing sets..
  4. Standardize Features: Standardize the feature values.
  5. Fit the Model: Initialize and fit the logistic regression model.
  6. Predict Labels: Predict the labels for the test set.
  7. Generate Report: Generate the classification report using ‘classification_report’.
  8. Confusion Matrix: Generate and display the confusion matrix using ‘confusion_matrix.

About Diabetes Dataset

Ten baseline variables — age, sex, BMI, average blood pressure, and six blood serum measurements — were obtained for each of n = 442 diabetes patients, along with the response variable indicating disease progression one year after baseline.

Data Set Characteristics

  • Number of Instances: 442
  • Number of Attributes: 10 numeric predictive variables
  • Target: Quantitative measure of disease progression

Attribute Information

  • age — age in years
  • sex — gender
  • bmi — body mass index
  • bp — average blood pressure
  • s1 — tc, T-Cells (a type of white blood cells)
  • s2 — ldl, low-density lipoproteins
  • s3 — hdl, high-density lipoproteins
  • s4 — tch, thyroid stimulating hormone
  • s5 — ltg, lamotrigine
  • s6 — glu, blood sugar level

Source: Diabetes Dataset

Simulation

Predicting disease progression as a binary classification.

Pre-Lab Questions

  1. Define logistic regression and where it's used.
  2. Difference between logistic and linear regression?

Post-Lab Questions

  1. What is a confusion matrix and what does it represent?
  2. Explain any one performance metric used in classification.

Result

The logistic regression model was successfully implemented using the diabetes dataset with a binarized target. It achieved an accuracy of____________%, and the classification report and confusion matrix confirmed effective binary classification performance.